I used .info() to see all of the columns and data types
Happiness Dataset Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4000 entries, 0 to 3999 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 4000 non-null object 1 Year 4000 non-null int64 2 Happiness_Score 4000 non-null float64 3 GDP_per_Capita 4000 non-null float64 4 Social_Support 4000 non-null float64 5 Healthy_Life_Expectancy 4000 non-null float64 6 Freedom 4000 non-null float64 7 Generosity 4000 non-null float64 8 Corruption_Perception 4000 non-null float64 9 Unemployment_Rate 4000 non-null float64 10 Education_Index 4000 non-null float64 11 Population 4000 non-null int64 12 Urbanization_Rate 4000 non-null float64 13 Life_Satisfaction 4000 non-null float64 14 Public_Trust 4000 non-null float64 15 Mental_Health_Index 4000 non-null float64 16 Income_Inequality 4000 non-null float64 17 Public_Health_Expenditure 4000 non-null float64 18 Climate_Index 4000 non-null float64 19 Work_Life_Balance 4000 non-null float64 20 Internet_Access 4000 non-null float64 21 Crime_Rate 4000 non-null float64 22 Political_Stability 4000 non-null float64 23 Employment_Rate 4000 non-null float64 dtypes: float64(21), int64(2), object(1) memory usage: 750.1+ KB
I used .describe() to see the ranges of the values and make sure data seems reasonable/real
Summary Statistics:
| Year | Happiness_Score | GDP_per_Capita | Social_Support | Healthy_Life_Expectancy | Freedom | Generosity | Corruption_Perception | Unemployment_Rate | Education_Index | ... | Public_Trust | Mental_Health_Index | Income_Inequality | Public_Health_Expenditure | Climate_Index | Work_Life_Balance | Internet_Access | Crime_Rate | Political_Stability | Employment_Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | ... | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 |
| mean | 2014.670750 | 5.455005 | 30482.009953 | 0.505860 | 67.917605 | 0.502723 | 0.143960 | 0.498920 | 10.966748 | 0.750385 | ... | 0.502812 | 69.976853 | 40.002648 | 6.009270 | 65.176380 | 5.987325 | 67.586327 | 45.526322 | 0.494105 | 74.021450 |
| std | 5.724075 | 1.427370 | 17216.122032 | 0.286202 | 10.172091 | 0.285219 | 0.200088 | 0.288866 | 5.210712 | 0.144819 | ... | 0.289186 | 17.128536 | 11.634987 | 2.291172 | 19.981357 | 1.725363 | 15.769023 | 20.300069 | 0.293191 | 13.906888 |
| min | 2005.000000 | 3.000000 | 1009.310000 | 0.000000 | 50.000000 | 0.000000 | -0.200000 | 0.000000 | 2.000000 | 0.500000 | ... | 0.000000 | 40.000000 | 20.010000 | 2.010000 | 30.010000 | 3.000000 | 40.010000 | 10.030000 | 0.000000 | 50.000000 |
| 25% | 2010.000000 | 4.237500 | 15425.125000 | 0.260000 | 59.177500 | 0.260000 | -0.030000 | 0.240000 | 6.450000 | 0.630000 | ... | 0.260000 | 55.580000 | 29.865000 | 4.040000 | 48.170000 | 4.460000 | 53.910000 | 27.840000 | 0.230000 | 61.867500 |
| 50% | 2015.000000 | 5.430000 | 29991.255000 | 0.510000 | 68.015000 | 0.500000 | 0.140000 | 0.500000 | 10.995000 | 0.750000 | ... | 0.500000 | 69.650000 | 40.015000 | 6.070000 | 64.755000 | 6.020000 | 68.015000 | 45.760000 | 0.490000 | 74.475000 |
| 75% | 2020.000000 | 6.662500 | 45763.085000 | 0.750000 | 76.690000 | 0.750000 | 0.310000 | 0.742500 | 15.450000 | 0.880000 | ... | 0.760000 | 84.582500 | 50.187500 | 8.010000 | 82.652500 | 7.490000 | 81.332500 | 63.197500 | 0.760000 | 85.912500 |
| max | 2024.000000 | 8.000000 | 59980.720000 | 1.000000 | 85.000000 | 1.000000 | 0.500000 | 1.000000 | 19.990000 | 1.000000 | ... | 1.000000 | 100.000000 | 59.970000 | 10.000000 | 99.990000 | 9.000000 | 94.990000 | 79.990000 | 1.000000 | 98.000000 |
8 rows × 23 columns
I used .value_counts() to see how the data was spread across countries
Value Counts: Country USA 429 France 415 Germany 413 Brazil 404 Australia 400 India 399 UK 395 Canada 386 South Africa 385 China 374 Name: count, dtype: int64
I used .hist() to see how each column's data is spread
I used scatter_matrix to see if there were any obvious connections between happiness_score and any other variables, or between subsets of variables (all on one scatter_matrix was too difficult to see due to large number of variables)
I used .isna() to find missing values (there were none)
There is no missing data.
I used .to_datetime to convert year in int64 to year in datetime (later I have to reverse this)
New dtype for Year column: datetime64[ns]
I created a dataframe of continents, and merged on country
| Country | Year | Happiness_Score | GDP_per_Capita | Social_Support | Healthy_Life_Expectancy | Freedom | Generosity | Corruption_Perception | Unemployment_Rate | Education_Index | Population | Urbanization_Rate | Life_Satisfaction | Public_Trust | Mental_Health_Index | Income_Inequality | Public_Health_Expenditure | Climate_Index | Work_Life_Balance | Internet_Access | Crime_Rate | Political_Stability | Employment_Rate | Continent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | China | 2022-01-01 | 4.39 | 44984.68 | 0.53 | 71.11 | 0.41 | -0.05 | 0.83 | 14.98 | 0.52 | 1311940760 | 78.71 | 8.88 | 0.34 | 76.44 | 46.06 | 8.92 | 62.75 | 8.59 | 74.40 | 70.30 | 0.29 | 61.38 | Asia |
| 1 | UK | 2015-01-01 | 5.49 | 30814.59 | 0.93 | 63.14 | 0.89 | 0.04 | 0.84 | 19.46 | 0.83 | 1194240877 | 50.87 | 5.03 | 0.72 | 53.38 | 46.43 | 4.43 | 53.11 | 8.76 | 91.74 | 73.32 | 0.76 | 80.18 | Europe |
| 2 | Brazil | 2009-01-01 | 4.65 | 39214.84 | 0.03 | 62.36 | 0.01 | 0.16 | 0.59 | 16.68 | 0.95 | 731100898 | 48.75 | 5.22 | 0.23 | 82.40 | 31.03 | 3.78 | 33.30 | 6.06 | 71.80 | 28.99 | 0.94 | 72.65 | South America |
| 3 | France | 2019-01-01 | 5.20 | 30655.75 | 0.77 | 78.94 | 0.98 | 0.25 | 0.63 | 2.64 | 0.70 | 1293957314 | 81.78 | 5.69 | 0.68 | 46.87 | 57.65 | 4.43 | 90.59 | 6.36 | 86.16 | 45.76 | 0.48 | 55.14 | Europe |
| 4 | China | 2022-01-01 | 7.28 | 30016.87 | 0.05 | 50.33 | 0.62 | 0.18 | 0.92 | 7.70 | 0.92 | 1432971455 | 82.39 | 6.33 | 0.50 | 60.38 | 28.54 | 7.66 | 59.33 | 3.00 | 71.10 | 65.67 | 0.12 | 51.55 | Asia |
Questions:
Happiness Score vs Year Linear Regression Fit: Slope: 0.001 R-squared: 0.000 P-value: 0.928
Happiness Score vs Employment Rate Linear Regression Fit: Slope: 0.010 R-squared: 0.008 P-value: 0.064
Happiness Score vs Healthy Life Expectancy- USA 2021-2024 Linear Regression Fit: Slope: 0.046 R-squared: 0.087 P-value: 0.004
I grouped by Country and Year, then calculated the mean for all numeric columns
| Country | Year | Happiness_Score | GDP_per_Capita | Social_Support | Healthy_Life_Expectancy | Freedom | Generosity | Corruption_Perception | Unemployment_Rate | Education_Index | Population | Urbanization_Rate | Life_Satisfaction | Public_Trust | Mental_Health_Index | Income_Inequality | Public_Health_Expenditure | Climate_Index | Work_Life_Balance | Internet_Access | Crime_Rate | Political_Stability | Employment_Rate | Year_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Australia | 2005-01-01 | 5.430667 | 28886.978000 | 0.528667 | 68.128667 | 0.372000 | 0.181333 | 0.517333 | 11.041333 | 0.778000 | 7.037067e+08 | 59.478667 | 5.685333 | 0.512667 | 74.255333 | 41.165333 | 5.525333 | 64.565333 | 6.286667 | 71.750667 | 44.341333 | 0.552667 | 72.108667 | 2005.0 |
| 1 | Australia | 2006-01-01 | 6.030556 | 28936.431111 | 0.677222 | 63.465000 | 0.495000 | 0.100556 | 0.449444 | 8.176111 | 0.736667 | 7.845685e+08 | 61.441111 | 6.388333 | 0.481111 | 75.430000 | 36.857222 | 5.838333 | 69.659444 | 5.902222 | 67.400556 | 47.636111 | 0.516111 | 73.490556 | 2006.0 |
| 2 | Australia | 2007-01-01 | 5.600000 | 36317.625455 | 0.505000 | 65.911818 | 0.529545 | 0.167273 | 0.487273 | 9.251818 | 0.780455 | 6.854900e+08 | 64.544545 | 6.788636 | 0.530000 | 71.362727 | 38.910000 | 6.015000 | 66.739091 | 5.937273 | 68.869545 | 46.938636 | 0.564545 | 73.953636 | 2007.0 |
| 3 | Australia | 2008-01-01 | 5.868889 | 27944.319444 | 0.467778 | 69.021111 | 0.616667 | 0.170000 | 0.480556 | 13.337222 | 0.696667 | 8.398988e+08 | 64.130556 | 6.635556 | 0.530000 | 60.281667 | 33.376667 | 6.148333 | 59.651111 | 5.455000 | 73.648889 | 47.990556 | 0.658333 | 69.658333 | 2008.0 |
| 4 | Australia | 2009-01-01 | 5.182857 | 29764.803810 | 0.524762 | 69.346667 | 0.574762 | 0.088571 | 0.549048 | 11.704762 | 0.756667 | 6.325921e+08 | 60.689524 | 6.563333 | 0.436667 | 66.839048 | 47.194286 | 6.183810 | 55.751905 | 6.234286 | 69.601905 | 37.245238 | 0.486190 | 73.119048 | 2009.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 195 | USA | 2020-01-01 | 5.271667 | 28603.595417 | 0.494167 | 68.694167 | 0.533333 | 0.162083 | 0.441667 | 13.001667 | 0.792500 | 7.466417e+08 | 59.628333 | 6.985000 | 0.479167 | 68.115833 | 37.422083 | 5.592083 | 55.560000 | 6.275417 | 71.215417 | 44.359583 | 0.460833 | 76.734167 | 2020.0 |
| 196 | USA | 2021-01-01 | 6.193478 | 27768.738261 | 0.629565 | 67.822609 | 0.413913 | 0.157391 | 0.520000 | 12.670870 | 0.706522 | 8.094940e+08 | 55.241739 | 6.189565 | 0.471304 | 76.129565 | 43.584783 | 5.960435 | 67.486087 | 6.153478 | 65.183478 | 52.734348 | 0.489565 | 77.133478 | 2021.0 |
| 197 | USA | 2022-01-01 | 5.830370 | 32318.473704 | 0.459630 | 69.417778 | 0.566296 | 0.140000 | 0.521481 | 9.208889 | 0.743704 | 7.504774e+08 | 62.747778 | 6.675556 | 0.428148 | 78.861852 | 42.779630 | 5.602963 | 64.268889 | 6.167037 | 61.077407 | 38.973333 | 0.495556 | 71.130741 | 2022.0 |
| 198 | USA | 2023-01-01 | 5.465714 | 34045.116667 | 0.534286 | 61.773333 | 0.560476 | 0.168571 | 0.503810 | 11.015238 | 0.729524 | 7.384771e+08 | 56.519048 | 6.005714 | 0.464762 | 66.837619 | 41.858095 | 6.668095 | 60.468095 | 6.246190 | 65.696667 | 42.794762 | 0.423333 | 72.164286 | 2023.0 |
| 199 | USA | 2024-01-01 | 5.072857 | 34432.532857 | 0.480952 | 67.951429 | 0.449524 | 0.091429 | 0.662857 | 10.604762 | 0.741905 | 8.471548e+08 | 63.304286 | 6.712857 | 0.440476 | 69.573810 | 36.480952 | 7.103333 | 60.900000 | 5.776190 | 71.822857 | 42.460952 | 0.519524 | 73.016667 | 2024.0 |
200 rows × 25 columns
There are multiple data points per country per year in this data set, with no overlap of any columns. It is possible that this data is an amalgamation of other data sets, or that there is a missing column that would explain why there are multiple points within a year (such as different groups being surveyed, different regions within a country, or different time points within a year). There are no clear overall correlations between any of the columns and happiness score. If the data is analyzed for the USA in just the years 2021-2024, there is a slight correlation between healthy life expectancy and happiness score, with both going down together over that time period. From this data, overall happiness over the period 2005-2024 is highest for Australia, lowest for the UK, and the USA ranks 3rd in this group for overall happiness score.